Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

We further show that the oscillation is factually controlled by the balanced parameter

attached to the reconstruction loss, providing a theoretical foundation for parameterizing

it in backpropagation. The oscillation only occurs when the gradient has a magnitude large

enough to change the sign of the latent weight. Consequently, we calculate the balanced

parameter based on the maximum magnitude of the weight gradient during each iteration,

leading to resilient gradients and eﬀectively mitigating the weight oscillation.

3.9.1

Problem Formulation

Most existing implementations simply follow previous studies [199, 159] to optimize A and

latent weights W based on a nonparametric bilevel optimization as:

W^∗= arg min

L(W; A^∗),

(3.139)

s.t. αⁿ^∗= arg min

αⁿ

∥wⁿ−αⁿ◦b^wⁿ∥²

2^,

(3.140)

where L(·) represents the training loss. Consequently, a closed-form solution of αⁿcan be

derived by channelwise absolute mean (CAM) as αⁿ

i ⁼

∥wⁿ

i,:,:,:^∥¹

M ⁿ

and M ⁿ= Cⁿ

in ^×^Kⁿ^×^Kⁿ^.

For ease of representation, we use wⁿ

i ^{as an alternative to}^wⁿ

i,:,:,: ^{in the following. The}

latent weight wⁿis updated using a standard gradient backpropagation algorithm, and its

gradient is calculated as:

δwⁿ

i ⁼^∂^L

∂wⁿ

= ^∂^L

∂ˆwⁿ

∂wⁿ

= αⁿ

∂L

∂ˆwⁿ

⊛1|wⁿ

i ^|≤¹^,

(3.141)

where ⊛denotes the Hadmard product and ˆwⁿ= αⁿ◦b^wⁿ.

Discussion. Equation (3.141) shows weight gradient mainly comes from the nonparametric

αⁿ

i ^{and the gradient}

∂L

∂ˆwⁿ

i ^.

∂L

∂ˆwⁿ

i ^{is automatically solved in backpropagation and becomes}

smaller with network convergence; however, αⁿ

i ^{is often magniﬁed by the trimodal distri-}

bution [158]. Therefore, the weight oscillation originates mainly from αⁿ

i ^{. Given a single}

weight wⁿ

i,j⁽¹^≤^j^≤^Mⁿ^{) centering around zero, the gradient}

∂L

∂wⁿ

i,j ^{is misleading, due to}

the signiﬁcant gap between wⁿ

i,j ^and^αⁿ

i ^b^wⁿ

i,j. Consequently, bilevel optimization leads to

frequent weight oscillations. To address this issue, we reformulate traditional bilevel opti-

mization using a Lagrange multiplier and show that a learnable scaling factor is a natural

training stabilizer.

3.9.2

Method

We ﬁrst give the learning objective as follows:

arg min

W,A

L(W, A) + LR(W, A),

(3.142)

where LR(W, A) is a weighted reconstruction loss and is deﬁned as:

LR(W, A) = ¹

n=1

Cout

i=1

γⁿ

i ^∥^wⁿ

i ⁻^αⁿ

i ^b^wⁿ

i ∥²

2^,

(3.143)